682 research outputs found
Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks
Human action recognition in 3D skeleton sequences has attracted a lot of
research attention. Recently, Long Short-Term Memory (LSTM) networks have shown
promising performance in this task due to their strengths in modeling the
dependencies and dynamics in sequential data. As not all skeletal joints are
informative for action recognition, and the irrelevant joints often bring noise
which can degrade the performance, we need to pay more attention to the
informative ones. However, the original LSTM network does not have explicit
attention ability. In this paper, we propose a new class of LSTM network,
Global Context-Aware Attention LSTM (GCA-LSTM), for skeleton based action
recognition. This network is capable of selectively focusing on the informative
joints in each frame of each skeleton sequence by using a global context memory
cell. To further improve the attention capability of our network, we also
introduce a recurrent attention mechanism, with which the attention performance
of the network can be enhanced progressively. Moreover, we propose a stepwise
training scheme in order to train our network effectively. Our approach
achieves state-of-the-art performance on five challenging benchmark datasets
for skeleton based action recognition
Skeleton-based Relational Reasoning for Group Activity Analysis
Research on group activity recognition mostly leans on the standard
two-stream approach (RGB and Optical Flow) as their input features. Few have
explored explicit pose information, with none using it directly to reason about
the persons interactions. In this paper, we leverage the skeleton information
to learn the interactions between the individuals straight from it. With our
proposed method GIRN, multiple relationship types are inferred from independent
modules, that describe the relations between the body joints pair-by-pair.
Additionally to the joints relations, we also experiment with the previously
unexplored relationship between individuals and relevant objects (e.g.
volleyball). The individuals distinct relations are then merged through an
attention mechanism, that gives more importance to those individuals more
relevant for distinguishing the group activity. We evaluate our method in the
Volleyball dataset, obtaining competitive results to the state-of-the-art. Our
experiments demonstrate the potential of skeleton-based approaches for modeling
multi-person interactions.Comment: 26 pages, 5 figures, accepted manuscript in Elsevier Pattern
Recognition, minor writing revisions and new reference
Heterogeneous Domain Generalization via Domain Mixup
One of the main drawbacks of deep Convolutional Neural Networks (DCNN) is
that they lack generalization capability. In this work, we focus on the problem
of heterogeneous domain generalization which aims to improve the generalization
capability across different tasks, which is, how to learn a DCNN model with
multiple domain data such that the trained feature extractor can be generalized
to supporting recognition of novel categories in a novel target domain. To
solve this problem, we propose a novel heterogeneous domain generalization
method by mixing up samples across multiple source domains with two different
sampling strategies. Our experimental results based on the Visual Decathlon
benchmark demonstrates the effectiveness of our proposed method. The code is
released in \url{https://github.com/wyf0912/MIXALL
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Multi-Domain Adversarial Feature Generalization for Person Re-Identification
With the assistance of sophisticated training methods applied to single
labeled datasets, the performance of fully-supervised person re-identification
(Person Re-ID) has been improved significantly in recent years. However, these
models trained on a single dataset usually suffer from considerable performance
degradation when applied to videos of a different camera network. To make
Person Re-ID systems more practical and scalable, several cross-dataset domain
adaptation methods have been proposed, which achieve high performance without
the labeled data from the target domain. However, these approaches still
require the unlabeled data of the target domain during the training process,
making them impractical. A practical Person Re-ID system pre-trained on other
datasets should start running immediately after deployment on a new site
without having to wait until sufficient images or videos are collected and the
pre-trained model is tuned. To serve this purpose, in this paper, we
reformulate person re-identification as a multi-dataset domain generalization
problem. We propose a multi-dataset feature generalization network (MMFA-AAE),
which is capable of learning a universal domain-invariant feature
representation from multiple labeled datasets and generalizing it to `unseen'
camera systems. The network is based on an adversarial auto-encoder to learn a
generalized domain-invariant latent feature representation with the Maximum
Mean Discrepancy (MMD) measure to align the distributions across multiple
domains. Extensive experiments demonstrate the effectiveness of the proposed
method. Our MMFA-AAE approach not only outperforms most of the domain
generalization Person Re-ID methods, but also surpasses many state-of-the-art
supervised methods and unsupervised domain adaptation methods by a large
margin.Comment: TIP (Accept with Mandatory Minor Revisions
- …